A high-quality assembled genome and its comparative analysis decode the adaptive molecular mechanism of the number one Chinese cotton variety CRI-12

Abstract Background Gossypium hirsutum L. is the most widely cultivated cotton species, and a high-quality reference genome would be a huge boost for researching the molecular mechanism of agronomic traits in cotton. Findings Here, Pacific Biosciences and Hi-C sequencing technologies were used to assemble a new upland cotton genome of the No. 1 Chinese cotton variety CRI-12. We generated a high-quality assembled CRI-12 genome of 2.31 Gb with a contig N50 of 19.65 Mb, which was superior to previously reported genomes. Comparisons between CRI-12 and other reported genomes revealed 7,966 structural variations and 7,378 presence/absence variations. The distribution of the haplotypes among A-genome (Gossypium arboreum), D-genome (Gossypium raimondii), and AD-genome (G. hirsutum and Gossypium barbadense) suggested that many haplotypes were lost and recombined in the process of polyploidization. More than half of the haplotypes that correlated with different tolerances were located on chromosome D13, suggesting that this chromosome may be important for wide adaptation. Finally, it was demonstrated that DNA methylation may provide advantages in environmental adaptation through whole-genome bisulfite sequencing analysis. Conclusions This research provides a new reference genome for molecular biology research on Gossypium hirsutum L. and helps decode the broad environmental adaptation mechanisms in the No. 1 Chinese cotton variety CRI-12.

clustering is an important aspect of evolutionary analysis, which is also associated with biological 172 characteristics. In our research, 22,854 gene families were shared by CRI-12, Gossypium hirsutum 173 TM-1, Gossypium barbadense, Gossypium mustelinum and Gossypium darwinii, while 555 gene 174 families were retained only by CRI-12 (Fig. 3b). Based on the gene family clustering analysis, in 175 order to investigate the expansion and contraction of gene members in CRI-12 and other cotton 176 species, six gene families, included MYB, WRKY, DREB, bZIP, NAC and AP2, were finally 177 selected. The results showed three gene families were expanded and 40 gene families were 178 contracted among 1,188 gene families shared in MRCA (most recent common ancestor) analysis 179 (Fig. 3c). Among the three expanded gene families, 36 genes were discovered, mainly located on 180 chromosomes A08, D08, A11 and D11. The discovery of abundant stress-related gene families and 181 genes demonstrates the utility of the newly assembled CRI-12 genome as a reference genome for 182 research on stress-related molecular biology. 183 Positive selection refers to a single copy gene family, in which a gene is affected by 184 environmental or human factors in the process of evolution, and non-synonymous mutation occurs 185 at the amino acid level to adapt to environmental changes. The probability of positive selection is 186 detected by calculating Ka/Ks using the maximum likelihood ratio. In this study, CRI-12 was used 187 as the foreground branch, and upland cotton, island cotton, wool cotton, yellow brown cotton and 188 Darwin's cotton were used as the background branches. Multiple sequence alignments of protein 189 sequences from single copy gene families were performed using MUSCLE software. For each 190 gene family, the branch-site model of the Codeml tool in PAML (phylogenetic analysis by 191 maximum likelihood, a package of programs for phylogenetic analyses of DNA and protein 192 sequences using maximum likelihood) was used to detect whether the gene family was positively 193 selected in the CRI-12 branch. In PAML, instead of simply searching for genes with the Ka/Ks 194 ratio >1, positive selection is determined by likelihood ratio tests of the two hypotheses. Finally,195 by likelihood ratio detection, 384 candidate genes were identified in CRI-12 ( Fig. 3d and  196 Supplementary Table S11). Based on the selected genes, results with a P value < 0.05 were 197 filtered out according to Fisher's exact test, and 63 and 27 significant pathways were obtained by 198 Gene Ontology (GO) and KEGG enrichment analysis, respectively (Supplementary Table S12  199 and Supplementary Table S13). One of the enriched GO terms "intracellular" (GO: 0005622, P 200 ≪ 0.01) contains many stress-related genes, including many MYB transcription factors [17], 201 cytochrome P450 genes [18], and E3 ubiquitin-protein [19], which were reported to be closely 202 correlated with multiple tolerances in cotton. The results showed that many genes correlated with 203 the cell membrane were reasonably important for environmental adaptation. 204

Major structural changes compared with other cultivated genomes 205
High-quality reference genomes provide a basis for accurate genome-wide structural variations, 206 which are closely correlated with multiple agronomic traits between different species. In our 207 research, a total of 7,966 structural variations (SVs) were identified with an average length of 208 48,791 bp compared with other cotton species, including Gossypium hirsutum TM-1, Gossypium 209 barbadense, Gossypium mustelinum and Gossypium darwinii (Supplementary Table S14). 210 Among all SVs, 46.16% (3 677) were deletion variations (DELs) and 40.57% (3 232) were 211 insertion variations (INS), while only a small proportion of 8.29% and 4.98% were copy number 212 variants (CNV) and inversion variations (INV), respectively, which suggested that deletion 213 variations and insertion variations were two main drivers in the differentiation process of cotton 214 species. The largest structural variation was located on chromosome D11. In addition, 7,379 PAVs 215 were obtained with the lengths ranging from 51 bp to 2,452,232 bp and the average length was 216 Table S15). The largest PAV was located on the D02 chromosome. 217

24,585 bp (Supplementary
The results also showed that the percentage of variations of both SVs and PAVs on the Dt 218 subgenome was lower than that on the At subgenome. In particular, there were lesser PAV 219 variations (13.80%) in the Dt subgenome than that in the At subgenome, indicating that variations 220 in the A subgenome was the main reason for the difference in agronomic characteristics. 221 We also investigated the SV and PAV variations on each chromosome, and the results showed 222 that chromosome D01 covered 529 SVs (7.17%, mainly containing copy number variants, deletion 223 variations, insertion variations and inversion variations) and 363 PAVs (4.56%), which was the 224 most compared with PAVs on other chromosomes. GO enrichment analysis of PAV-related genes 225 ( Fig. 4a) showed that molecular transducer activity (GO: 0060089, p<0.01), signaling receptor 226 activity (GO: 0038023, p<0.05), and the G-protein-coupled receptor signaling pathway (GO: 227 0007186, p<0.05) were three main terms while SV-related genes ( Fig. 4b) were mainly enriched 228 in organelle (GO: 0043226, p<0.01), intracellular non-membrane-bounded organelle (GO: 229 0043232, p<0.001), and non-membrane-bounded organelle (GO: 0043228, p<0.0001), which were 230 all belonged to cellular components. Pathway enrichment analysis of SV-and PAV-related genes 231 showed that most variation-related genes were correlated with organelles, signaling receptor and 232 molecular transducers, indicating that the evolution and differences of organelles, signal reception 233 and transduction related genes may be important factors leading to the great differences of 234 agronomic traits among different cotton varieties (Fig. 4c, d). In contrast, several chromosomes 235 contained less variation, e.g., D03 (138 SVs and 124 PAVs), D04 (169 SVs and 115 PAVs) and 236 D13 (189 SVs and 129 PAVs), which indicated that two chromosomes may be conserved for 237 containing many fundamental growth related genes in the long-term evolution process of cotton. 238 Based on our previous research, 420 genes were obtained by selective sweep analysis in 239 CRI-12, including 2, 2 and 20 haplotype blocks correlated with V. wilt, salt-and drought-240 tolerance, respectively [12]. Among these haplotype blocks, more than half (13/24) were located 241 on chromosome D13. Interestingly, these 12/13 haplotype blocks were correlated with 242 drought-tolerance (Supplementary Table S16), indicating that D13 chromosome played a crucial 243 role in the process of drought resistance adaptation of cotton varieties. In addition, another 244 haplotype block (M2: ATCTCGCATGTAGAGTTCAT CCGGTAGAAACCGTTTTACAT) was 245 also found to be correlated with Verticillium wilt, suggesting that chromosome D13 may be 246 important for the formation process of multiple tolerances. 247

Strong haplotypes were discovered in the polyploidization and evolution of diploid cottons 248
A haplotype means a group of genes that have a close linkage relationship in an organism and 249 these haplotypes could be inherited from parents to their descendants. In our previous research, 250 haplotype polymorphisms in CRI-12 and its descendants and different reported genomes (G. 251 arboreum, G. raimondii, G. hirsutum, and G. barbadense) were investigated [12]. All 252 allotetraploid cotton species came from a single polyploidization event between the A-genome and 253 D-genome approximately 1-2 million years ago [11,20]. In the polyploidization process, two 254 diploid genomes were hybridized into a tetraploid genome, along with the fusion and 255 recombination of haplotypes in each diploid cotton. With that in mind we investigated the 256 haplotype polymorphisms in the A-genome (G. arboreum), D-genome (G. raimondii), TM-1 (G. 257 hirsutum), and Hai7124 (G. barbadense), and obtained 31 769, 37,177, 51,682, 51,023 haplotypes, 258 respectively (Supplementary Table S17). A total of 56,267 haplotypes were discovered in 259 which was the highest number in the different cotton species. Alongside G. hirsutum and G. 260 barbadense, in CRI-12 the number of haplotypes was smaller than the sum of the A-genome (G. 261 arboreum) and D-genome (G. raimondii), indicating that more than 10,000 haplotypes were lost 262 or recombined in the process of polyploidization (Fig. 5). Comparisons between G. hirsutum 263 CRI-12 and G. barbadense Hai7124 showed that the number of haplotypes between G. hirsutum 264 CRI-12 and G. barbadense Hai7124 was approximately 10% more than that between G. hirsutum 265 TM-1 and G. barbadense Hai7124, suggesting that the haplotype polymorphism in CRI-12 was 266 more abundant than other tetraploid cotton species, which may be correlated with a great deal of 267 human selection and strong haplotypes in the breeding process in CRI-12. we revealed the haplotype inheritance and recombination of agronomically important genes in 276 artificial selection [12], and combined with the haplotypes identified before, a total of 66 277 differentially methylated haplotypes were found in the CRI-12 family (Supplementary Table  278 S18). Among these haplotypes, six were derived from its female parent Uganda4 and 19 279 haplotypes were derived from its male parent Xingtai6871, indicating that the greater contribution 280 of DNA methylation haplotypes was from the male parent Xingtai6871 as it was a domestic 281 variety while Uaganda4 was a foreign variety. Approximately 12.12% (8/66) of DNA methylation 282 haplotypes were enriched on chromosome D13, suggesting that DNA methylation variations on 283 D13 may play a crucial role in the regulatory mechanism of haplotypes in CRI-12. In addition, 284 methylation types in each haplotype were studied, and the results showed that six haplotypes were 285 labeled as CG-up methylation under both drought and salt treatment ( Supplementary Fig. S7), 286 which indicated that DNA methylation variations in these haplotypes may provide adaptive 287 advantages in responding to different stresses (Fig. 6). 288

Discussion 289
In this study, we performed de novo assembly of the CRI-12 genome by integrating multiple sets 290 of data from the PacBio platform, 110× genome equivalent sequencing, and Hi-C technology. All 291 these results indicated substantial improvements to the contiguity and accuracy of assembly, with 292 a significant enhancement in the assembly of centromeres. By comparing the two high-quality 293 genome assemblies (CRI-12 and Gossypium hirsutum ZM24), 7,966 SVs (accounting for 12.65% 294 of the assembled genome) and 7,378 PAVs (accounting for 17.85% of the assembled genome) 295 were obtained between different species, demonstrating these large variations projected 296 differences in traits and species differentiation. Structural variations are generally considered to be 297 relatively large variations and stable, hence, SV-related genes may be the main cause for the 298 differences in characteristics. Chromosomes D03, D04 and D13 contained fewer than 200 SVs, 299 significantly lower than other chromosomes, suggesting that these chromosomes are relatively 300 conserved. Cotton polyploidization was a crucial event in cotton history, and tracking the 301 haplotype mechanism is beneficial for understanding the cotton evolution. In addition, strong 302 haplotypes contained in G. hirsutum CRI-12 suggested intense human selection and domestication 303 occurred during the breeding process. 304 CRI-12, a cotton variety well known by every cotton breeder in China, have been repeatedly 305 used for breeding new cotton varieties, and haplotype blocks inheritance and recombination of 306 agronomically important genes is likely one of the main reasons that CRI-12 is such an effective 307 breeding parent. The whole-genome scale methylation map of CRI-12 suggested that DNA 308 methylation variations may be closely correlated with the haplotype block inheritance and 309 recombination. To our knowledge, this is the first genome map of the widely cultivated upland 310 cotton CRI-12, which could provide more understanding of crop domestication, evolution and 311 diversification and the discovery of novel domestication-related genes conferring agronomically 312 beneficial traits in future breeding programs. In addition, it demonstrates the suitability of 313 selecting the CRI-12 genome as a new reference genome for its many suitable indicators. 314

Conclusion 315
In this study multiple sequencing techniques and analytical methods were used to assemble a new 316 upland cotton genome of the most popular Chinese cotton variety CRI-12. The newly assembled 317 CRI-12 genome was approximately 2.30 Gb in length and several benchmarks reveal its improved 318 quality to other reported genomes. This research providing a better quality alternative for cotton 319 researchers when selecting a reference genome. 320

Plant materials 322
G. hirsutum L. acc. CRI-12 was selected for genome assembly because of its excellent 323 performance and wide influence on cotton breeding and genetic research in China and many other 324 cotton-growing countries across the globe. CRI-12 seeds were planted in the greenhouse for 20 325 days at the Institute of Cotton Research of Chinese Academy of Agricultural Science and young 326 leaves from a single plant were harvested and snap frozen with liquid nitrogen for the extraction of 327 genomic DNA [22]. In addition, root, stem and flower tissues were used for transcriptome 328 sequencing for genome annotation work using three replicates. 329

Annotation of repeats 340
After genome assembly, repeat annotation was performed for the CRI-12 genome. Repetitive 341 sequences include transposable elements (TEs) and tandem repeats. Two approaches were used to 342  The chromatin extraction method was similar to that used in the DNase I digestion 359 experiment. The procedures were similar to those described previously. Briefly, chromatin was 360 digested for 16 h with 400 U HindIII restriction enzyme (NEB) at 37 °C. DNA ends were labeled 361 with biotin and incubated at 37 °C for 45 min, and the enzyme was inactivated with 20% SDS 362 solution. DNA ligation was performed by the addition of T4 DNA ligase (NEB) and incubation at 363 16 °C for 4~6 hours. After ligation, proteinase K was added to reverse cross-linking during 364 incubation at 65 °C for overnight. DNA fragments were purified and dissolved in 86 μL of 365 ultrapure water. Unligated ends were then removed. Purified DNA was fragmented to a size of 366 300-500 bp, and DNA ends were then repaired. DNA fragments labeled by biotin were finally 367 separated on Dynabeads® M-280 Streptavidin (Life Technologies). Hi-C libraries were 368 controlled for quality and sequenced on an Illumina HiSeq X Ten sequencer (Illumina HiSeq X 369 Ten, RRID:SCR_016385). 370

Hi-C library preparation and sequencing 371
Following the standard protocol previously described with certain modifications [27], we 372 constructed Hi-C libraries using the CRI-12 seedlings as inputs (20-day-old, detailed culturing 373 conditions were described in the main text). After being ground in liquid nitrogen, seedling tissues 374 were cross-linked with 4% formaldehyde solution at room temperature in a vacuum for 30 mins. 375 2.5 M of glycine was added to quench the crosslinking reaction for 5 min and then placed on ice 376 for 15 min. The sample was centrifuged at 2500 rpm at 4 °C for 10 min, and the pellet was washed 377 with 500 μl of PBS and then centrifuged for 5 min at 2500 rpm. The pellet was resuspended in 20 378 μl lysis buffer (1 M Tris-HCl, pH 8.0, 1 M NaCl, 10% CA-630, and 13 units protease inhibitor), 379 and then the supernatant was centrifuged at 5000 rpm at room temperature for 10 min. The pellet 380 was washed twice in 100 μl of ice cold 1x NEB buffer and then centrifuged for 5 min at 5000 rpm. 381 The nuclei were resuspended in 100 μl of NEB buffer and solubilized with dilute SDS followed by 382 incubation at 65 °C for 10 min. After quenching the SDS with Triton X-100, a 4-cutter restriction 383 enzyme MboI (400 units) was applied for overnight digestion at 37 °C on a rocking platform. 384 The following steps were involved in marking the DNA ends with biotin-14-dCTP and 385 blunt-end ligation of the cross-linked fragments. The proximal chromatin DNA was religated by 386 ligation enzyme. The nuclear complexes were reversely cross-linked by incubation with proteinase 387 K at 65 °C. DNA was purified by phenol-chloroform extraction. Biotin was removed from 388 nonligated fragment ends using T4 DNA polymerase. Ends of fragments sheared by sonication 389

Hi-C assisted assembly 395
The Hi-C technique obtains information about interactions between DNA fragments that are 396 spatially connected, that is, physically distant. Different contigs or scaffolds were divided into 397 different chromosomes according to the rule that the interaction probability within chromosomes 398 was significantly higher than the interaction probability between chromosomes. The contigs or 399 scaffolds on the same chromosome are sequenced and orientated base on the interaction 400 probability decreasing with the increase of the interaction distance on the same chromosome.

Sequence quality checking and filtering 414
To avoid reads with artificial bias, we removed the following type of reads: (a) reads with ≥ 10% 415 unidentified nucleotides (N); (b) reads with > 10 nt aligned to the adapter, allowing ≤ 10% 416 mismatches; (c) reads with > 50% bases having phred quality < 5; and (d) putative PCR duplicates 417 generated by PCR amplification in the library construction process. 418

Haplotype analysis (alignment, variant calling, HapCUT analysis) 419
The high-quality paired-end Hi-C reads were first mapped to the reference genome using 420 Burrows-Wheeler Aligner (BWA) software [28]. Alignment files were converted to BAM files 421 using SAMtools [29], and then the alignment results were improved as follows: (a) filter the 422 alignment read with mapping quality = 0; (b) sort the BAM files as physical coordinate; (c) 423 remove potential PCR duplications. If multiple read pairs have identical external coordinates, only 424 the pair with the highest mapping quality retained. (d) Local InDel realignment was performed. 425 The filtered BAM files of CRI-12 leaves were used as input for variant calling using the 426 Genome Analysis Toolkit version 3.1.1 (GATK) [30]. SNPs and InDels were remained if the 427 depth of alternative variants was above 2 and the genotype quality was more than 20. 428 We used the modified version of the HapCUT [31] algorithm to perform haplotype imputation 429 for each individual. HapCUT constructs a graph with the heterozygous variants as nodes and 430 DNA fragment (s) connecting two nodes as edges. Therefore, only fragments with at least two 431 heterozygous variants are useful for haplotype phasing. HapCUT extracts such 432 "haplotype-informative" BAM files using a sorting method that stores each potential 433 haplotype-informative read in a buffer until its mate is seen. We set 30 Mb as the maximum 434 "insert size" for a paired-end read to be considered as a single fragment for phasing. We used a maximum of 1000 iterations to find the maximum cut for each haplotype block in a 440 given iteration in CRI-12. 441

Hi-C read mapping and filtering and generation of contact matrices 442
For the Hi-C experiment, chromatin was crosslinked with formaldehyde, then digested, and 443 religated to capture 3D interactions. In principle, interactions within chromosomes are more 444 frequent than those among chromosomes, and the intrachromosome interaction frequency decays 445 with the increasing distance. Contigs/scaffolds are thus ordered and oriented. The high quality 446 paired-end Hi-C reads were mapped to mm10 and filtered using HiCUP (HiCUP, 447 RRID:SCR_005569) [32]. The first stage in the HiCUP pipeline involves truncating reads at the 448 enzyme digestion ligation site (HindIII in our experiment) that separates two DNA fragments. 449 After truncation, the resulting trimmed forward and reverse reads were sent for alignment by 450 Bowtie2 software (Bowtie 2, RRID:SCR_016368) [33]. These unique high-quality alignments 451 were remained for further analysis. HiCUP removes sequences representing experimental  artifacts and other uninformative di-tags, since even a small number of invalid di-tags could lead 453 to incorrect conclusions being drawn concerning genomic structure. 454 The genome was divided into 1 Mb bins, and the read pair numbers in two regions were 455 counted as the observed interactions. The observed matrix was normalized for GC content near 456 the ligated fragment ends, fragment lengths digested by HindIII and the mappability of the 457 fragment ends [34]. Through normalization, we obtained the expected interactions between every 458 two bins. The norm interactions were computed by observed interactions divided by expected 459 interactions. We used the norm interactions for every two bins to produce a norm contact matrix. samples.gtf samples.bam). The differentially expressed mRNAs were selected with fold change > 499 2 or fold change < 0.5 and p value < 0.05 using the R package edgeR [37] or DESeq2 and then 500 GO and KEGG enrichment to the differentially expressed mRNAs [38,39]. 501

Generation of inter-chromosomal contacts matrix 502
The expected number of inter-chromosomal interactions for each chromosome pair i,j was 503 computed by multiplying the fraction of inter-chromosomal reads containing i with the fraction of 504 inter-chromosomal reads containing j and multiplying by the total number of inter-chromosomal 505 reads. The enrichment was computed by taking the actual number of interactions observed 506 between i and j and dividing it by the expected value. 507 The inter-chromosomal contact possibility was computed by the observed read pairs between 508 chromosome pair i,j dividing it by its expected value. The expected number of inter-chromosomal 509 interactions for each chromosome pair i,j was calculated by multiplying the proportion of 510 inter-chromosomal reads containing i with the proportion of inter-chromosomal reads containing j 511 and the total number of inter-chromosomal reads. In addition, based on colinear mapping results, 512 comparative genome analysis between CRI-12 and other published upland genomes was 513 performed to investigate different types of structure variations (SVs). Enrichment analysis of 514 SV-related genes was also performed, including GO and KEGG enrichment. 515

Promoter associated interactions statistics 516
The Eukaryotic Promoter Database (EPD) is a collection of databases of experimentally validated 517 promoters for selected model organisms. We downloaded 21,239 mouse TSS sites from the EPD 518 database in the mm9 genome version. We considered the region from 1,000 bp upstream to 100b 519 downstream of the TSS site as the promoter region. These promoter sequences were subsequently 520 mapped to mm10 and retained unique alignment to obtain the promoter region in the new genome 521 version. After alignment, 21,226 promoters were left for further statistical analysis. We counted 522 the interaction numbers in promoter-promoter, promoter-other and other-other. 523

Annotation of protein coding genes 524
De novo, homolog-based and RNA-seq based predictions were employed to annotate the protein 525 coding genes in the CRI-12 genome. Five ab initio gene prediction programs were used to predict was obtained from the corresponding InterPro descriptions. Additionally, the gene set was mapped 548 to a KEGG [56] (release 53) pathway to identify the best match classification for each gene. 549 Finally, 72,293 protein coding genes (accounting for 99.30%) were functionally annotated. 550

Gene family cluster 555
Gene families were generated using OrthoMCL [60]. First, nucleotide and protein data of 5 556 species (upland cotton, island cotton, wool cotton, yellow brown cotton and Darwin's cotton) were 557 downloaded from the Ensembl (Release 70) and NCBI databases. Before an "all against all" 558 BLASTP (E-value ≤ 1E-07) program, the longest transcript was selected from alternative splicing 559 transcripts belonging to one gene and genes with ≤ 50 amino acids were then removed. 560 Alignments with high-scoring segment pairs (HSPs) were conjoined for each gene pair using solar 561 [61]. To identify homologous gene-pairs, a threshold of > 30% coverage of the aligned regions in 562 both homologous genes was required. Finally, alignments were clustered into gene families using 563 OrthoMCL utilizing a 1.5 inflation index. After clustering, 22,854 gene families were detected 564 across Kobo&Cbra and four other species. 565

Phylogenetic tree construction and divergence time estimation 566
Single-copy orthologs were utilized to construct the phylogenetic tree. CDS sequences of these among 14 species with main parameters of burn-in=100,000, sample-number=100,000, and 572 sample-frequency=2. Calibration points were selected and the TimeTree website was chosen as a 573 normal prior to restrain the age of the nodes. The split of Kobo was estimated, close to that 574 reported by others. 575

Gene family expansion and contraction 576
We determined the expansion and contraction of the gene families through comparing the cluster 577 size differences between the ancestor and each species using the CAFÉ program [66]. A random 578 birth and death model was used to study changes in gene families along each lineage of the 579 phylogenetic tree. A probabilistic graphical model (PGM) was introduced to calculate the 580 probability of transitions in gene family size from parent to child nodes in the phylogeny. Using 581 conditional likelihoods as the test statistics, we calculated the corresponding p-values in each 582 lineage. With a p-value of 0.05 used to identify gene families that were significantly expanded and 583 contracted. 584

Screening of positively selected genes in CRI-12 585
The CDS alignments of single-copy gene families were generated using MUSCLE [62]. Gblocks 586 [66] was applied to filter poorly aligned positions and divergent regions of the CDS alignments. 587 With Kobo and Rapi as foreground branches, positive selection sites were detected based on 588 branch-site models of PAML [65] using CDS alignments. P values were computed using the χ 2 589 statistic and adjusted by the FDR method. 590

Whole-genome duplication analysis 591
We used BLASTP (E-value < 1e-5) to perform a homolog search with the Kobo genome and 592 MCScanX was used to detect syntenic blocks. Then, Ks rates were calculated for all syntenic 593 genes to identify putative whole genome duplication events in Kobo. 594

Whole-genome DNA methylation analysis 595
High-quality genomic DNA was isolated and used for the construction of DNA methylation 596 library according to the previously described methods [67]. Methylation levels and differentially 597 methylated regions (DMRs) were obtained using swDMR software [68]. Based on the results of 598 haplotype block inheritance and recombination of agronomically important genes in CRI-12, 599 conjoint analysis was performed to discover methylation haplotypes.

Data availability 603
The sequencing data that support the findings of this study have been deposited in the CNGB 604 We are grateful to Hangzhou LC-Bio Technology Co., Ltd for assisting in sequencing and 641 bioinformatics analysis. 642

Additional files 643
Supplementary Figure